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Abstract 

Dimensionality reduction (DR) methods have been commonly used as a principled way to understand the high-dimensional data 
such as facial images. In this paper, we propose a new supervised DR method called Optimized Projection for Sparse Representation 
based Classification (OP-SRC), which is based on the recent face recognition method, Sparse Representation based Classification 
(SRC). SRC seeks a sparse linear combination on all the training data for a given query image, and make the decision by the 
minimal reconstruction residual. OP-SRC is designed on the decision rule of SRC, it aims to reduce the within-class reconstruction 
residual and simultaneously increase the between-class reconstruction residual on the training data. The projections are optimized 
and match well with the mechanism of SRC. Therefore, SRC performs well in the OP-SRC transformed space. The feasibility and 
effectiveness of the proposed method is verified on the Yale, ORL and UMIST databases with promising results. 
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1. Introduction 

In many application domains, such as appearance-based ob¬ 
ject recognition, information retrieval and text categorization, 
the data are usually provided in high-dimensional form. One 
of the problems is the so-called ’’curse of dimensionality” |[ll], 
which is a well known but not entirely well-understood phe¬ 
nomenon. Limited data lie in high-dimensional space, and im¬ 
portant features are not so much. Moreover, it has been ob¬ 
served that a large number of features may actually degrade the 
performance of classifiers if the number of training samples is 
small relative to the number of features ||2l. Consequently, di¬ 
mensionality reduction is essential not only to engineering ap¬ 
plications but also to the design of classifiers. In fact, the de¬ 
sign of a classifier becomes extremely simple if all patterns of 
the same class hold the same feature vector while hold different 
feature vectors between classes. 

Up to now, a large family of algorithms had been designed to 
provide different solutions to the problem of DR. Among them, 
the linear algorithms Principal Component Analysis (PCA) J^l 
and Linear Discriminative Analysis (LDA) 101 had been the two 
most popular methods due to their relative simplicity and effec¬ 
tiveness. However, PCA and LDA considered only the global 
scatter of training samples and they failed to reveal the essen¬ 
tial data structures nonlinearly embedded in a high dimensional 
space. To overcome these limitations, the manifold learning 
methods were proposed by assuming that the data lie in a low 
dimensional manifold of the high dimensional space ||3l. Lo¬ 
cality Preserving Projection (LPP) |@] was one of the represen¬ 
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tative manifold learning methods. Success of manifold learning 
implies that the high dimensional facial images can be sparsely 
represented or coded by the representative samples on the man¬ 
ifold. Very recently, Wright et al. presented a Sparse Represen¬ 
tation based Classification (SRC) method for face recognition 
if^ . The main idea of SRC is to represent a given test sample as 
a sparse linear combination of all training samples, the nonzero 
sparse representation coefficients are supposed to concentrate 
on the training samples with the same class label as the test 
sample. SRC shows that the classification performance of most 
meaningful features converges when the feature dimension in¬ 
creases if a SRC classifier is used. Although this does provide 
some new insights into the role of feature extraction played in a 
pattern classification tasks, Qiao et al. lUt] argued that designing 
an effective and efficient feature extractor is still of great im¬ 
portance since the classification algorithm could become sim¬ 
ple and tractable, and a unsupervised DR method called Spar¬ 
sity Preserving Projections (SPP) was proposed, which aimed 
to preserve the sparse reconstructive relationship of the data 
in low-dimensional subspace. Yang and Chu 191] proposed a 
Sparse Representation Classifier steered Discriminative Projec¬ 
tion (SRC-DP) method. It used the decision rule of SRC to 
steer the design of a dimensionality reduction method. SRC- 
DP iteratively obtained the projection matrix and spare coding 
coefficient of each training data. But the convergence of SRC- 
DP was not clear, and also it was time consuming due to the 
large computing cost of iterative sparse coding. 

In this paper, to enhance the recognition performance of SR, 
we propose a supervised DR method base on sparse representa¬ 
tion, which is named the Optimized Projection for Sparse Rep¬ 
resentation based Classification (OP-SRC). Similar to SRC-DP, 
OP-SRC aims to gain a discriminative projection such that SRC 
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achieves the optimum performance in the transformed low¬ 
dimensional space. Since SRC predicts the class label of a 
given test sample based on the representational residual, OP- 
SRC utilizes the label information to enhance the residuals 
more informative. We will also show that OP-SRC is naturally 
orthogonal, which may help preserve the shape of the data dis¬ 
tribution. 

The remainder of this paper is organized as follows: Section 
2 reviews the SRC algorithm. Section 3 presents the OP-SRC 
method. The experimental results are presented in Section 4 
and some discussions will be presented based on the results on 
several databases. Finally, we conclude this paper in Section 5. 

2. Sparse Representation based Classification 

Given sufficient c classes training samples, a basic problem 
in pattern recognition is to correctly determine the class which 
a new coming (test) sample belongs to. We arrange the n, 
training samples from the /-th class as columns of a matrix 
Xi - [xn, - • ■ e where m is the dimension. Then 

we obtain the training sample matrix X = [Xi, • ■ ■ ,Xc], where 
« = is ths total number of training samples. 

Under the assumption of linear representation, a test sample 
y 6 M'" will approximately lie on the linear subspace spanned 
by training samples 

y = Xa e K'" (1) 

If m < n, the system of Eq. ([T]i is underdetermined, and also, 
its solution is not unique. This motivates us to seek the sparest 
solution to Eq. ([TJ, by solving the following ^-minimization 
problem: 

: do = argmin Hallo subject to y = Xa, (2) 

where || ■ ||o denotes the f^-norm, which counts the number of 
nonzero entries in a vector. However, the problem of finding the 
sparsest solution of an underdetermined system of linear equa¬ 
tions is NP-hard and difficult even to approximate Hi. The 
theory of compressive sensing M in reveals that if the solu¬ 
tion to the ^‘’-minimization problem is sparse enough, then it is 
equal to the following -minimization problem 

(f') : di = argmin ||a||i subject to y = 2fa, (3) 

In order to deal with occlusion, the ^’-minimization problem is 
extended to the stable -minimization problem as follow: 

(f’) : di = argmin ||a||i subject to ||y - 2fa||2 < e, (4) 
where e is a given tolerance. 

Eor a given test sample y, SRC first computes its sparse repre¬ 
sentation coefficient a\ by solving the ^’-minimization problem 
© or a, then determines the class of this test sample from its 
reconstruction error between this test sample and the training 
samples of class i, 

r,(a) = ||y-Xd,(a)|| 2 . (5) 


Then the class C(y) which the test sample y belongs to is deter¬ 
mined by 


C(y) = argminr,(a). 


(6) 


SRC is robust to noise and performs well for face recogni¬ 
tion, it attracts much attention in recent years and boosts the re¬ 
search of sparsity based machine learning. Elhamifa and Vidal 
El proposed a more robust classification method using struc¬ 
tured sparse representation, while Gao et al. 01411 introduced a 
kernel version of SRC. In El, the -graph was established by 
sparsely coding one sample over the other samples for cluster¬ 
ing. But in this paper, we focus on the sparse representation 
based dimensionality reduction problem, not the extension of 
SRC. A discriminative learning method is presented in the next 
section. 


3. Optimized Projection for Sparse Representation based 

Classification 

In this section, we consider the supervised DR problem. 
Considering a training sample x (belonging to the f-th class) 
and its sparse representation coefficient a based on other train¬ 
ing samples as a dictionary. Ideally, the entries of a are zero ex¬ 
cept those associated with the f-th class. In many practical face 
recognition scenarios, the training sample x could be partially 
corrupted or occluded. Or sometimes the training samples are 
not enough to represent the given sample. In these cases, the 
residual associated with the /-th class r,(x) may be not small 
enough, and may produce an erroneous predict. Thus, the Op¬ 
timized Projection for Sparse Representation based Classifica¬ 
tion (OP-SRC) is proposed which aims to seek a linear projec¬ 
tion matrix such that in the transformed low-dimensional space, 
the within-class reconstruction residual is as small as possible 
and simultaneously the between-class reconstruction residual is 
as large as possible. 

Let P e be the optimized projection matrix with 

d m. The data matrix in the original input space R™ are 
mapped into a t/-dimensional space R'^, that is, Y - P^X. Eor 
each training sample yij - P^Xij from Y in the transformed d- 
dimensional space R'^, by solving the extended f’-minimization 
problem (|4]l, we obtain its sparse coding coefficient a,; by using 
the remaining training samples as a dictionary. Based on the de¬ 
cision rule of SRC, we define the within-class residual matrix 
as follows 

^ c rij 

~ Y6i(aij))(yij - Y6i(aij)f. (7) 

” 7=1 

The between-class residual matrix is defined as follow 

^ c rij 

1=1 j=\ 

The total residual matrix is defined as follow 
~ nRw + n{c-\)RB 

Kj - - {y) 

nc 

Y C Tlj C 

= — Z Z Z^^’^ “ - Y6i(aij)V(.10) 

,= 1 ;=1 /=! 


Eor each class i, 6i(a) : R” ^ R” is the characteristic function 
which selects the coefficients associated which the /-th class. 
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Figure 1: Samples of two subjects from the Yale database. 



Table 1: Mean recognition rates (%) and standard deviations on the Yale database. 


Methods 

4 Train 

5 Train 

6 Train 

7 Train 

PCA 

LDA 

SPP 

SRC-DP 

OP-SRC 

0.64670.044(52) 

0.7170.057(14) 

0.6070.049(57) 

0.7060.049(29) 

0.7580.048(48) 

0.6710.029(64) 

0.7520.039(14) 

0.6380.045(72) 

0.7240.035(37) 

0.7940.036(62) 

0.7250.042(88) 

0.7990.046(14) 

0.6760.044(88) 

0.7710.042(34) 

0.8330.038(74) 

0.7210.055(64) 

0.8140.046(14) 

0.7020.046(104) 

0.7730.043(43) 

0.8530.048(88) 


To make SRC perform well on training data, we expect that 
the within-class residual is as small as possible and simultane¬ 
ously the between-class residual is as large as possible. There¬ 
fore, we can choose to maximize the following criterion IB 

J(P) ^ tr{/3RB - Rw), (11) 

where p is the weight parameter which balances the between- 
class and within-class residual information. Since P is a linear 
mapping, it is easy to show Rw - P^RwP and Rb - P^RbP, 
where 


We can use the Lagrange multipliers to transform the above 
objective function to include the constraint 

d 

L{pk,h) ^J]plmB-Rw)-Mplpk - 1 ). ( 16 ) 

k=\ 

The optimization is performed by setting the partial derivative 
of L with respect to pk to zero 

= 0 ePB-p^-T,/)p, = o,k= 1 ,( 17 ) 

opk 

Now we obtain 


^ c rij 

- X6i(aij))(xij - X6i(aij)f, (12) 

” ;=1 ;=1 


(PRb - Rw)Pk -‘^kPk,k - I,-■ ■ ,d, (18) 


Rb^ 


1 

n{c - 1) 


C flj 

/=1 7=1 li^i 


which means that the Ak’s are the eigenvalues of PRb - Rw and 
the pk’s are the corresponding eigenvectors. Thus 


So, we have 

J{P)^tr{P^(pRB-Rw)P)- (14) 

In order to avoid degenerate solutions, we additionally re¬ 
quire that P is constituted by the unit vectors, i.e. P = 
[pi, ■■■, Pd] and p^Pk = 1, k = One may use other 

constraints. For example, we can require tr(P^RwP) - 1 and 
then maximize tr{P^R bP). The motivation by using the con¬ 
straint p^pk - 1 is that it will result to an orthogonal projec¬ 
tion, which may help preserve the shape of the data distribu¬ 
tion. Thus, the objective function can be recast as the following 
optimization problem; 

d 

m?kX^pl(l5RB-Rw)Pk (15) 

k=\ 

subject io plpk - - I, - ‘ • ,d. 


d d d 

J{P) = ^ pirns - Rw)Pk = ^ dkplpk = Ak. (19) 

i:=l t=I k=l 

Therefore, P is composed of the first d largest eigenvectors of 
PRb - Rw and J{P) is maximized. 

The solution of the optimization problem (fTSl) has the follow¬ 
ing property; 

Proposition 1. The columns of the optimal solution P to the 
optimization problem ( 1751 ) are orthogonal, that is, pjpj — 0, 
for any i + j, and pJpi — 1. 

It is easy to prove the orthogonality of solution P due to the 
symmetry of (JSRb - Rw)- Thus, OP-SRC is an supervised or¬ 
thogonal projection method which may preserve more discrim¬ 
inative information for classification, expecially for the SRC 
method. 
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Figure 2: Accuracy rates versus reduced dimensions on the Yale database: (a) 4 Train; (b) 5 Train; (c) 6 Train; (d) 7 Train. 


4. Experimental Verification 

In this section, we investigate the performance of our pro¬ 
posed OP-SRC method for face representation and recognition. 
The system performance is compared with PCA i], LDA i], 
SPP iit] and SRC-DP (|^. PCA and LDA are two of the most 
popular linear methods in FR. SPP and SRC-DP are two new 
methods corresponding to sparse representation. Similar to SPP 
and SRC-DP, we first perform PCA to reduce the dimension 
before implementing OP-SRC. Finally, SRC is employed for 
classification. 

4.1. Data Sets and Experimental Settings 

We test our proposed method on three popular face databases, 
including Yale |3l, ORL ifp^ and UMIST There are wide- 
range variations, including pose, illumination, and gesture al¬ 
terations, existing in the databases. For these databases, we 
randomly select part of the images per class for training (i.e. 4, 

5, 6, and 7 of 11 images per subject for Yale, 4, 5, 6 and 7 of 
10 images per subject for ORL and 6, 8, 10 and 12 of about 29 
images per subject for UMIST), and the remainder for test. In 


particular, with the given training set, the prmection P is learned 
by PCA, LDA, SPP, SRC-DP and OP-SRcfl respectively, and 
the test samples are subsequently transformed by the learned 
projection. Then specific classifier is employed to evaluate the 
recognition rates on the test data, and SRC is used in this paper. 

In the experiments, the images are cropped to a size of 32 x 
32, and the gray level values of all images are rescaled to [0,1]. 
20 training/test splits are randomly generated and the average 
classification accuracies over these splits are reported in tables 
and figures. 

The SPAMS package ifl^ is used for solving the ex¬ 
tended ^'-minimization problem (4). In our experiments, we 
experimentally set s - 0.05 (refer to (4)) which usually leads 
SRC to better performance than other parameters, and set fi — 
0.25 (refer to (14)) by searching in a large range of candidates. 

4.2. Yale Database 

The Yale database contains 165 gray scale images of 15 in¬ 
dividuals. It was constructed at the Yale Center for Compu- 


*The Matlab code can be found from our homepage: 
http://mail.ustc.edu.cn/~canyilu/ 
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Figure 3: Samples of two subjects from the ORL database. 


Table 2: Mean recognition rates (%) and standard deviations on the ORL database. 


Methods 

4 Train 

5 Train 

6 Train 

7 Train 

PCA 

LDA 

SPP 

SRC-DP 

OP-SRC 

0.8980.019(127) 

0.8990.019(39) 

0.8610.018(108) 

0.8880.018(124) 

0.9250.017(153) 

0.9210.018(183) 

0.9300.017(39) 

0.8870.026(170) 

0.9180.018(131) 

0.9500.017(195) 

0.9410.018(193) 

0.9410.019(39) 

0.9030.031(180) 

0.9290.028(221) 

0.9680.015(224) 

0.9540.023(134) 

0.9500.023(39) 

0.9220.026(202) 

0.9430.022(190) 

0.9750.013(255) 


tational Vision and Control. The images demonstrate varia¬ 
tions in lighting condition, facial expression (normal, happy, 
sad, sleepy, surprised, and wink). Figure [1] shows some sam¬ 
ples of two subjects of the Yale database. A random subset 
with I (=4, 5, 6, 7) images per individual is taken with labels to 
form the training set, and the rest of the database is considered 
to be the test set. For each given /, we average the recognition 
accuracy over 20 random splits. Notice that LDA is different 
from other methods because the maximal number of dimension 
is less than the number of class c |3l- 

In general, the performance of all these methods varies with 
the number of dimensions. We show the best results and the 
optimal dimensions obtained by PCA, LDA, SPP, SRC-DP and 
OP-SRC in Table [1] including the mean of accuracies as well 
as the standard deviations. 

From Tablelll it can be found that OP-SRC obtains the high¬ 
est recognition rates in all cases. Figure |2 shows the plots of 
accuracy rates versus reduced dimensions. Note that, when the 
dimension of feature continues to increase, the performance of 
the OP-SRC algorithm decreases and has the same accuracy 
with PCA on the highest dimension. In this case, the obtained 
optimized projection matrix P is square and orthogonal, that is 
pTp ^ ppT = /. Thus, WP'^x - P'^Xah = ||x - Xalh- The 
sparse representation coefficient in the transformed space will 
be the same as in the subspace projected by PCA. Thus, they 
always obtain the same recognition result. 

4.3. ORL Database 

The ORL database consists of 10 face images from 40 sub¬ 
jects for a total of 400 images, with some variations in poses, 
facial expressions and details. Some images were captured at 
different times and had different variations including expression 
(open or closed eyes, smiling or nonsmiling) and facial details 
(glasses or no glasses). The images were taken with a tolerance 
for some tilting and rotation of the face up to 20 degrees. Figure 
3 shows some samples of two subjects of the ORL database. A 


random subset with / (=4, 5, 6, 7) images per individual is taken 
with label to form the training set. The rest of the database is 
considered to be the test set. The experimental protocol is the 
same as that on the Yale database. The recognition results are 
shown in Table 2 and Figure 4. 

From Table |2] and Figure 01 we find that most dimensionality 
reduction methods perform well, since the variation of faces in 
the ORL database is limited. PCA is even more accurate than 
SRC-DP which is supervised. If the number of training data is 
small, i.e. 4 and 5 samples of each subject for training, OP-SRC 
also performs worse than PCA in low-dimensional space, but 
much better in high-dimensional space. It seems that OP-SRC 
may lead to overfitting on the ORL database in low-dimensional 
space with limited training data. 

4.4. UMISTDatabase 

The UMIST database contains 564 images of 20 individuals, 
each covering a range of poses from profile to frontal views. 
The subjects cover a range of race, sex and appearance. We 
use a cropped version of the UMIST database that is publicly 
available at S. Roweis’ web pag^l Figure|5]shows some images 
of two subjects of the UMIST database. We randomly select / 
(=6, 8, 10, 12) images from each individual for training, and 
the rest for test. 

Table [3 gives the best classification accuracy rates and the 
corresponding standard deviations of five algorithms under dif¬ 
ferent sizes of the training set. Figure plots the recogni¬ 
tion rates of five algorithms under different reduced dimensions 
when the size of training samples from each class is 6, 8, 10 
and 12, respectively. From Tableland Figure |6] we find that 
OP-SRC outperforms the other methods in different dimensions 
and different numbers of training data setting. 


^http://cs.nyu.edu/ roweis/data.html 
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Figure 4: Accuracy rates versus reduced dimensions on the ORL database: (a) 4 Train; (b) 5 Train; (c) 6 Train; (d) 7 Train. 


4.5. Discussions 

Based on the results on the Yale, ORL and UMIST databases, 
we draw the following observations and discussions; 

1. OP-SRC always outperforms PCA, SPP and SRC-DP on 
the Yale and UMIST databases, and also is more accurate 
than PCA when the subspace dimension exceeds a certain 
threshold on the ORL database. OP-SRC even performs 
better than LDA in low-dimensional subspace on the ORL 
and UMIST databases. The top average recognition rates 
of OP-SRC are much higher than PCA, SPP, LDA and 
SRC-DP on these three databases. The superior of OP- 
SRC comes from its orthogonality and matching well with 
the SRC algorithm. 

2. Similar to other dimensionality reduction methods, the 
recognition accuracy of OP-SRC will hrst increase accord¬ 
ing to the dimensions, but decrease at last and obtain the 
same result as PCA on the highest dimension. This is be¬ 
cause the data is hrst projected onto a PCA subspace, and 
the f^-norm is invariant to orthogonal the OP-SRC projec¬ 
tion on the highest dimension. 


3. From our experiments, we also hnd that OP-SRC is more 
efficient than SPP and SRC-DP which are all spare coding 
based methods. It is more practical for real applications. 


5. Conclusions 

In this paper, based on sparse representation, we propose 
a new algorithm called Optimized Projection for Sparse Rep¬ 
resentation based Classihcation (OP-SRC) for supervised di¬ 
mensionality reduction. The optimized projection of SRC de¬ 
creases the within-class reconstruction residual and simultane¬ 
ously increases the between-class reconstruction residual which 
matches with SRC optimally in theory. OP-SRC is orthogo¬ 
nal which may help preserve more discriminative information 
for classihcation. The experimental results on the three face 
databases clearly demonstrate that the proposed OP-SRC has 
much better performance than PCA, LDA, SPP and SRC-DP, 
and also it is more effective with respect to the sparse represen¬ 
tation based classihcation. 
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Figure 5: Samples of two subjects from the UMIST database. 


Table 3: Mean recognition rates (%) and standard deviations on the UMIST database. 


Methods 

6 Train 

8 Train 

10 Train 

12 Train 

PCA 

LDA 

SPP 

SRC-DP 

OP-SRC 

88.352.32(105) 

83.541.82(15) 

83.082.69(80) 

85.632.20(75) 

89.411.93 (115) 

92.483.13(125) 

86.583.17(15) 

87.252.64(105) 

89.422.73(105) 

93.932.98(105) 

0.95921.29(110) 

91.151.26(15) 

91.172.13(135) 

93.281.50(120) 

0.97441.19(105) 

96.931.84(85) 

92.181.68(15) 

90.452.78(155) 

93.072.48(130) 

98.001.57(120) 






Figure 6: Accuracy rates versus reduced dimensions on the UMIST database: (a) 6 Train; (b) 8 Train; (c) 10 Train; (d) 12 Train. 
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